## carat clarity color cut price
## 1 0.51 SI2 I Very Good 774
## 2 0.93 IF H Ideal 6246
## 3 0.50 VVS2 D Very Good 1146
## 4 0.30 VS1 F Ideal 538
## 5 0.31 SI1 F Ideal 502
## 6 1.00 VS1 F Ideal 7046
## carat clarity color cut price clarity2 carat2
## 1 0.51 SI2 I Very Good 774 SI <1
## 2 0.93 IF H Ideal 6246 IF <1
## 3 0.50 VVS2 D Very Good 1146 VVS <1
## 4 0.30 VS1 F Ideal 538 VS <1
## 5 0.31 SI1 F Ideal 502 SI <1
## 6 1.00 VS1 F Ideal 7046 VS 1 - 1.9
carat: Weight of the diamond in carats (0.23 -
7.09)clarity: Measurement of how clear the diamond is
I1 : Included Diamonds with obvious inclusions that
impact beautySI2: Slightly Included Diamonds with inclusions
detectable to keen unaided eye, especially when viewed from the
sideSI1: Slightly Included Diamonds with inclusions
noticable at 10x magnification (best value)VS2: Very Slightly Included Diamonds with minor
inclusions that are somewhat easy to see at 10x magnificationVS1: Very Slightly Included Diamonds with minor
inclusions that are difficult to seeVVS2: Very Very Slightly Included Diamonds with
minuscule inclusions that are difficult even for trained eyes to see
under 10x magnificationVVS1: Very Very Slightly Included Diamonds with
minuscule inclusions that are difficult even for trained eyes to see
under 10x magnificationIF: Internally Flawless Diamonds with no inclusions
within the stone, only surface characteristics set the gradeFL: Flawless Diamonds with no internal or external
characteristics (rare)color: Measurement of faint diamond color
D: Rarest and highest quality with a pure icy
lookE: Rarest and highest quality with a pure icy
lookF: Rarest and highest quality with a pure icy
lookG: No discernible color; great value for the
qualityH: No discernible color; great value for the
qualityI: No discernible color; great value for the
qualityJ: No discernible color; great value for the
qualitycut: Cut quality of Diamond
Good: This cut represents roughly the top 25% of
diamond cut quality. It reflects most of the light that enters, but not
as much as a Very Good cut grade.Very Good: This cut represents roughly the top 15% of
diamond cut quality. It reflects nearly as much light as the ideal cut,
but for a lower price.Ideal: This rare cut represents roughly the top 3% of
diamond cut quality. It reflects most of the light that enters the
diamond.Astor Ideal: These diamonds are crafted to gather and
reflect the most light possible. Cut from the finest raw material (rough
stones with as few impurities or inclusions as possible), they meet
rigorous quality requirements and exhibit outstanding brilliance, fire,
and scintillation. In addition to being graded by the GIA, all Astor by
Blue Nile™ diamonds are certified by GemEx®.price: Price is U.S. Dollars ($322 - $355403)## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
Although many diamonds of “Very Good” and “Ideal” cuts are sold for much higher prices than any “Good” diamond is sold for, on average they are less expensive based only on the subset of data we are evaluating. Additionally, all cut qualities seem to have very similar price per carat relationships indicating that they are not a powerful factor in price determination.
## `geom_smooth()` using formula = 'y ~ x'
The color D means the diamond is pure, icy, and lacking color. As the color values sequence up through the alphabet, the amount of color present increases. It is clear from the plot above that the price per carat is higher for diamonds with less color than those with more. This affirms Blue Niles claim that the absence of color makes a diamond more expensive.
The spike in average price for diamonds the color H may be patrially attributed to the fact that the H colored diamonds in this data set were, on average, larger carat weights than the other colored diamonds.
filtered_data <- diamond %>%
filter(price <= 9000, carat <= 5)
carat_categories <- cut(filtered_data$carat, breaks = seq(0, max(filtered_data$carat), by = 0.25))
price_categories <- cut(filtered_data$price, breaks = 6)
heatmap_data <- data.frame(
carat = carat_categories,
price = price_categories
)
# Count the occurrences of each combination of carat and price
heatmap_counts <- table(heatmap_data)
# Convert the table to a data frame for plotting
heatmap_df <- as.data.frame(heatmap_counts)
# Rename the columns for better readability
names(heatmap_df) <- c("Carat", "Price", "Quantity")
# Create the heatmap using ggplot2
heatmap_plot <- ggplot(heatmap_df, aes(x = Carat, y = Price, fill = Quantity)) +
geom_tile() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_gradient(low = "white", high = "red") +
labs(
title = "Heatmap of Quantity by Carat and Price",
x = "Carat Weight",
y = "Price"
) +
theme_minimal()
# Display the heatmap
heatmap_plot
ggplot(diamond, aes(x=carat,y=price)) + geom_point() + geom_smooth(method = "lm", se=FALSE) +
labs(title="Scatterplot of Price Against Carat", x="Carat", y="Price")
## `geom_smooth()` using formula = 'y ~ x'
Here is the starting data. It seems like our errors do not have mean zero (there are more points above the line, then below, then above) and have increasing variance (points have greater vertical spread as carat increases)
result<-lm(price~carat, data=diamond)
par(mfrow = c(2, 2))
plot(result)
We can see more clearly now that several assumptions are violated. We can see from the residual plot that the residuals do not have mean zero (the line is not parallel to the x-axis) and do not have constant variance (we do not have an even vertical spread of data points as we move from left to right). Also, from the QQ plot we can see that our data are not normally distributed. Finally, from the residuals vs. leverage plot we see that there are several points which have a Cooks distance greater than 1, and should thus be treated as influential outliers. In conclusion, we need to transform both y and x.
result <- lm(price~carat, data=diamond)
library(MASS)
MASS::boxcox(result, lambda = seq(0.2, 0.4, 1/10))
From the Box-Cox plot, we will try the transformation \(y^* = y^{0.3}\)
ystar <- diamond$price^0.3
diamond_transformed<-data.frame(diamond$carat,ystar)
ggplot(diamond_transformed, aes(x=diamond.carat,y=ystar)) + geom_point() + geom_smooth(method = "lm", se=FALSE) +
labs(title="Scatterplot of Price (Transformed) Against Carat", x="Carat", y="Price ^ 0.3")
## `geom_smooth()` using formula = 'y ~ x'
result<-lm(ystar~diamond.carat, data=diamond_transformed)
par(mfrow = c(2, 2))
plot(result)
After transforming the response variable we can see that assumptions 1 and 2 are still not met. We can see from the residual plot that the residuals do not have mean zero (the line is not parallel to the x-axis) and do not have constant variance (we do not have an even vertical spread of data points as we move from left to right).
Let’s see if we can better fix this increasing variance. We know from the Box-Cox plot that \(\lambda < 1\), so let’s try a log transformation.
ystar <- log(diamond$price)
diamond_transformed<-data.frame(diamond$carat,ystar)
ggplot(diamond_transformed, aes(x=diamond.carat,y=ystar)) + geom_point() + geom_smooth(method = "lm", se=FALSE) +
labs(title="Scatterplot of Price (Transformed) Against Carat", x="Carat", y="log(Price)")
## `geom_smooth()` using formula = 'y ~ x'
result<-lm(ystar~diamond.carat, data=diamond_transformed)
par(mfrow = c(2, 2))
plot(result)
These seems to have better fixed the increasing variance compared to the first transformation, so lets stick with \(y^* = log(y)\). We now need to transform x as well.
Based on the scatterplot, it seems like another log transformation for x will be the best option. It seems to match our data well and will allow us to best interpret our results.
xstar<-log(diamond$carat)
diamond_final<-data.frame(xstar,ystar)
ggplot(diamond_final, aes(x=xstar,y=ystar)) + geom_point() + geom_smooth(method = "lm", se=FALSE) +
labs(title="Scatterplot of Price (Transformed) Against Carat (Transformed)", x="log(Carat)",
y="log(Price)")
## `geom_smooth()` using formula = 'y ~ x'
result<-lm(ystar~xstar, data=diamond_final)
par(mfrow = c(2, 2))
plot(result)
It seems like we have finally satisfied the assumptions. Looking at the residual plot, we see that the line indicating the average values of the residuals is parallel to the x-axis, so assumption 1 is satisfied. We also see a more constant vertical variation as x increases, so assumption 2 is satisfied as well.
summary(result)
##
## Call:
## lm(formula = ystar ~ xstar, data = diamond_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96394 -0.17231 -0.00252 0.14742 1.14095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.521208 0.009734 875.4 <2e-16 ***
## xstar 1.944020 0.012166 159.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2761 on 1212 degrees of freedom
## Multiple R-squared: 0.9547, Adjusted R-squared: 0.9546
## F-statistic: 2.553e+04 on 1 and 1212 DF, p-value: < 2.2e-16
Our regression equation is \(\hat{y}^* = 8.521 + 1.944x^*\), where \(y^* = log(y)\) and \(x^* = log(x)\). We also have \(R^2 = 0.9547\). We can interpret this coefficient as follows: For an \(a\%\) increase in carat, the price is multiplied by approximately \((1+\frac{a}{100})^{1.9}\). This means a \(44\%\) increase in carat corresponds to a doubling in price.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine